Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs
نویسندگان
چکیده
MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like C++, Java, Python, or Ruby. We introduce, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. We focus on the optimization opportunities presented by the large space of configuration parameters for these programs. We also introduce a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation. All components have been prototyped for the popular Hadoop MapReduce system. The effectiveness of each component is demonstrated through a comprehensive evaluation using representative MapReduce programs from various application domains.
منابع مشابه
MapReduce Programming and Cost-based Optimization? Crossing this Chasm with Starfish
MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical succes...
متن کاملModeling and optimizing MapReduce programs
MapReduce frameworks allow programmers to write distributed, dataparallel programs that operate on multisets. These frameworks offer considerable flexibility to support various kinds of programs and data. To understand the essence of the programming model better and to provide a rigorous foundation for optimizations, we present an abstract, functional model of MapReduce along with a number of c...
متن کاملA What-if Engine for Cost-based MapReduce Optimization
The Starfish project at Duke University aims to provide MapReduce users and applications with good performance automatically, without any need on their part to understand and manipulate the numerous tuning knobs in a MapReduce system. This paper describes the What-if Engine, an indispensable component in Starfish, which serves a similar purpose as a costing engine used by the query optimizer in...
متن کاملCosting Generated Runtime Execution Plans for Large-Scale Machine Learning Programs
Declarative large-scale machine learning (ML) aims at the specification of ML algorithms in a high-level language and automatic generation of hybrid runtime execution plans ranging from single node, in-memory computations to distributed computations on MapReduce (MR) or similar frameworks like Spark. The compilation of large-scale ML programs exhibits many opportunities for automatic optimizati...
متن کاملTraffic Analysis in MapReduce
-MapReduce is a programming model, which can process the large set of data and produces the output. The MapReduce contains two functions to complete the work, those are Map function and Reduce function. The Map function will get assign fragmented data as input and then its emit intermediate data with key and send to this intermediate data with key to the Reducer, where Reducer will get the inpu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 4 شماره
صفحات -
تاریخ انتشار 2011